Implement zero-copy tokenization for identifiers, strings, and comments #2136
Draft
eyalleshem wants to merge 2 commits intoapache:reduce-string-copyingfrom
Draft
Implement zero-copy tokenization for identifiers, strings, and comments #2136eyalleshem wants to merge 2 commits intoapache:reduce-string-copyingfrom
eyalleshem wants to merge 2 commits intoapache:reduce-string-copyingfrom
Conversation
This change introduces a lifetime parameter 'a to BorrowedToken enum
to prepare for zero-copy tokenization support. This is a foundational
step toward reducing memory allocations during SQL parsing.
Changes:
- Added lifetime parameter to BorrowedToken<'a> enum
- Added _Phantom(Cow<'a, str>) variant to carry the lifetime
- Implemented Visit and VisitMut traits for Cow<'a, str> to support
the visitor pattern with the new lifetime parameter
- Fixed lifetime issues in visitor tests by using tokenized_owned()
instead of tokenize() where owned tokens are required
- Type alias Token = BorrowedToken<'static> maintains backward
compatibility
6773837 to
1c80b40
Compare
…hitespace Convert token string fields to use Cow<'a, str> to enable zero-copy tokenization for commonly used tokens: - Word.value: Regular identifiers and keywords now borrow from source - SingleQuotedString: String literals borrow when no escape processing needed - Whitespace: Single-line and multi-line comments borrow from source Also add benchmark for measuring tokenization performance
1c80b40 to
5458a2b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements zero-copy tokenization by using borrowed strings (
&str) instead of owned strings (String) for identifiers, string literals, and comments. This eliminates unnecessary string allocations during the tokenizationprocess.
Changes
Tokenvariants to store&'a strinstead ofStringfor:Wordtokens (identifiers like table/column names)SingleQuotedStringliteralsWhitespaceto_uppercase()allocationtokenize_benchcriterion benchmark for performance measurementPerformance Impact
Benchmark results using a complex 27KB SQL query with CTEs, joins, window functions, and extensive comments:
tokenization/tokenize_complex_sql
time: [254.68 µs 254.81 µs 254.97 µs]
change: [−60.885% −60.682% −60.482%] (p = 0.00 < 0.05)
Performance has improved.